Skip to content

ARROW-12597: [C++] Enable per-row-group parallelism in async Parquet reader#10482

Closed
lidavidm wants to merge 6 commits into
apache:masterfrom
lidavidm:arrow-12597
Closed

ARROW-12597: [C++] Enable per-row-group parallelism in async Parquet reader#10482
lidavidm wants to merge 6 commits into
apache:masterfrom
lidavidm:arrow-12597

Conversation

@lidavidm

@lidavidm lidavidm commented Jun 8, 2021

Copy link
Copy Markdown
Member

This adds an OptionalParallelForAsync which lets us have per-row-group parallelism without nested parallelism in the async Parquet reader. This also uses TransferAlways, taking care of ARROW-12916. enable_parallel_column_conversion is kept as it still affects the threaded scanner.

@github-actions

github-actions Bot commented Jun 8, 2021

Copy link
Copy Markdown

@lidavidm

lidavidm commented Jun 8, 2021

Copy link
Copy Markdown
Member Author

S3 Median Scan Time (s)(2)

Not much difference in a benchmark; the most pronounced change is when files << cores (this was a 4 vcpu machine), which I think makes sense since with many files, file-level parallelism takes hold.

Comment thread cpp/src/arrow/util/parallel.h Outdated
Comment thread cpp/src/arrow/util/parallel.h Outdated
Comment thread cpp/src/parquet/arrow/reader.cc Outdated
Comment thread cpp/src/parquet/arrow/reader.cc Outdated
Comment thread cpp/src/parquet/arrow/reader.cc Outdated
Comment thread cpp/src/parquet/arrow/reader.cc Outdated
Comment thread cpp/src/parquet/arrow/reader.cc Outdated
@pitrou pitrou self-requested a review June 15, 2021 13:53

@pitrou pitrou left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thank you very much!

@pitrou

pitrou commented Jun 15, 2021

Copy link
Copy Markdown
Member

Rebased, can merge if green.

@lidavidm lidavidm closed this in b73bcf0 Jun 15, 2021
@lidavidm lidavidm deleted the arrow-12597 branch June 15, 2021 15:22
sjperkins pushed a commit to sjperkins/arrow that referenced this pull request Jun 23, 2021
…reader

This adds an OptionalParallelForAsync which lets us have per-row-group parallelism without nested parallelism in the async Parquet reader. This also uses TransferAlways, taking care of ARROW-12916. `enable_parallel_column_conversion` is kept as it still affects the threaded scanner.

Closes apache#10482 from lidavidm/arrow-12597

Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants